Check the significance of a subsequence of data based on the number of runs or the length of the longest. These tests work on any data with a limited number of symbols. They are used within the Dimodal package to test peaks in the signed difference of the interval spacing, which has values -1, 0, and +1.
Dinrun.test(x, stID, endID, feps, lower.tail=TRUE)
Dirunlen.test(x, stID, endID, feps)Dinrun.test and Dirunlen.test return lists of class
"Ditest" with elements
a string describing the test
function used to evaluate significance level/probability
what is tested, the number of runs or maximum length
text string describing the statistic
other distribution arguments, for the runs test Erun
the expected count and Vrun its variance, for the maximum run length
featlen the feature length
probability of feature
for the run length test, the transition matrix of the Markov chain
for the run length test, the weight applied to starting to each state
statistic, parameter, and p.value are vectors with the
length of the stID and endID vectors.
a vector with a limited number of distinct values, including numeric, integer, character, factor, or logical
a scalar or vector of data indices of the start indices within x of the features
a scalar or vector of data indices of the end indices (incl.) of the features
closeness of tied real values, as per find.runs, and ignored for
other types
a boolean, if TRUE the test returns the probability that the run count is not more than the observed, if FALSE is more than
Dinrun.test compares the number of runs within each sequence defined
by the endpoints to the expected, based on a combinatorial counting of all
possible sequences of the symbols. The number of runs is distributed
normally, with an expected value and variance that depends on the number of
each symbol. Wolf and Wolfowitz derived formulas for these values in the
case of two symbols, and Kaplansky and Riordan generalized this to
arbitrary sets of symbols. This test implements that general version. Its
value is the probability of getting the actual number of runs or fewer, i.e.
the lower tail.
Filtering introduces correlation between symbols, which we can account for by using a Markov chain model. The length of a run can be estimated by separating the symbol transition matrix into two parts, the diagonal which generates a matching symbol and the off-diagonal elements which switch to another. A new Markov chain modeling a run uses these two sub-matrices with the advancing steps placed on the diagonal of the run's transition matrix and the resetting steps in the first column. An absorbing state, or identity matrix, captures all runs longer than that being tested. The run length probability follows from the chance of entering this absorbing state after stepping the new chain over the length of the feature. This amounts to a recursion with the sub-matrices followed weighting by the steady-state or symbol's stationary state vector to sum over all possible starting symbols. We assume the symbol's Markov chain has order one.
These tests use find.runs to determine the number of runs and longest
within the endpoints, and ignores any NA and NaN values in x. Its
feps argument is used to consider which real values are the same. NA
and NaN start or end indices generate NA statistics and probabilities, so
that the features from Dipeak and Diflat can be passed
directly. Note that if doing so, and passing the difference of the spacing
as x, stID should be increased by 1 to account for the point
lost in the difference. Numeric indices are rounded to integers, and other
types or values that are out of bounds to the vector raise errors. If the
runs are based on a discrete set of values, as they are in Dimodal, then
feps can be set to 0, otherwise the option "peak.fhtie" could be used.
The runs statistic test involves an O(n) scan of each sequence to
count the number of symbols within; there is therefore little advantage to
calling this with vectors of indices, rather than one making separate calls
one feature at a time. The longest runs test requires estimating the
transition matrix over all data, which is O(n), and this overhead
can be shared by calling it once for all features. The actual recursion
is expensive, involving O(2 l^2 n) matrix multiplications and
additions, where l is the longest run length and n the
sequence length; the matrix size equals the number of symbols (3x3 for the
signed difference).
The probabilities should be evaluated against options "alpha.nrun" and
"alpha.runlen" for the minimum passing level.
A. Wald, J. Wolfowitz (1940), On a test whether two samples are from the same population. The Annals of Mathematical Statistics 11, pp. 147--162.
I. Kaplansky and J. Riordan (1945), Multiple matching and runs by the symbolic method, The Annals of Mathematical Statistics, 16, pp. 272--277.
find.runs
## The inner diff generates the spacing, the outer the signed difference.
xrun <- sign(diff( diff(sort( iris$Petal.Width )) ))
## No epsilon needed for signed values.
Dinrun.test(xrun, 1, length(xrun), 0)
Dirunlen.test(xrun, 1, length(xrun), 0)
Run the code above in your browser using DataLab